%load_ext dotenv
%dotenv dev.env
import pandas as pd
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
%matplotlib inlineIntro to ValidMind
ValidMind Python Library Introduction
Initializing the ValidMind Library
After creating an account with ValidMind, we can find the project’s API key and secret in the settings page of the ValidMind dashboard.
The library credentials can be configured in two ways:
- By setting the
VM_API_KEYandVM_API_SECRETenvironment variables or - By passing
api_keyandapi_secretarguments to theinitfunction like this:
vm.init(
api_key='<your-api-key>',
api_secret='<your-api-secret>',
project="cl2r3k1ri000009jweny7ba1g"
)The project argument is mandatory since it allows the library to associate all data collected with a specific account project.
import validmind as vm
vm.init(
api_host = "http://localhost:3000/api/v1/tracking",
project = "clhdxzbb700020a8hpu126rq0"
)Connected to ValidMind. Project: Customer Churn Model - Initial Validation (clhdxzbb700020a8hpu126rq0)
Using a demo dataset
For this simple demonstration, we will use the following bank customer churn dataset from Kaggle: https://www.kaggle.com/code/kmalit/bank-customer-churn-prediction/data.
We will train a sample model and demonstrate the following library functionalities:
- Logging information about a dataset
- Running data quality tests on a dataset
- Logging information about a model
- Logging training metrics for a model
- Running model evaluation tests
Running a data quality test plan
We will now run the default data quality test plan that will collect the following metadata from a dataset:
- Field types and descriptions
- Descriptive statistics
- Data distribution histograms
- Feature correlations
and will run a collection of data quality tests such as:
- Class imbalance
- Duplicates
- High cardinality
- Missing values
- Skewness
ValidMind evaluates if the data quality metrics are within expected ranges. These thresholds or ranges can be further configured by model validators.
Load our demo dataset
Before running the test plan, we must first load the dataset into a Pandas DataFrame and initialize a ValidMind dataset object:
df = pd.read_csv("./datasets/bank_customer_churn.csv")
vm_dataset = vm.init_dataset(
dataset=df,
target_column="Exited",
class_labels={
"0": "Did not exit",
"1": "Exited",
}
)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Initialize and run the TabularDataset test plan
We can now initialize the TabularDataset test plan. The primary method of doing this is with the run_test_plan function from the vm module. This function takes in a test plan name (in this case tabular_dataset) and a dataset keyword argument (the vm_dataset object we created earlier):
vm.run_test_plan("tabular_dataset", dataset=vm_dataset)tabular_plan = vm.run_test_plan("tabular_dataset", dataset=vm_dataset)
Results for Tabular Dataset Description Test Plan:
Logged the following dataset to the ValidMind platform:
| RowNumber | CustomerId | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8000.000000 | 8.000000e+03 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 |
| mean | 5020.520000 | 1.569047e+07 | 650.159625 | 38.948875 | 5.033875 | 76434.096511 | 1.532500 | 0.702625 | 0.519875 | 99790.187959 | 0.202000 |
| std | 2885.718516 | 7.190247e+04 | 96.846230 | 10.458952 | 2.885267 | 62612.251258 | 0.580505 | 0.457132 | 0.499636 | 57520.508892 | 0.401517 |
| min | 1.000000 | 1.556570e+07 | 350.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 11.580000 | 0.000000 |
| 25% | 2518.750000 | 1.562816e+07 | 583.000000 | 32.000000 | 3.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 50857.102500 | 0.000000 |
| 50% | 5036.500000 | 1.569014e+07 | 651.500000 | 37.000000 | 5.000000 | 97263.675000 | 1.000000 | 1.000000 | 1.000000 | 99504.890000 | 0.000000 |
| 75% | 7512.250000 | 1.575238e+07 | 717.000000 | 44.000000 | 8.000000 | 128044.507500 | 2.000000 | 1.000000 | 1.000000 | 149216.320000 | 0.000000 |
| max | 10000.000000 | 1.581566e+07 | 850.000000 | 92.000000 | 10.000000 | 250898.090000 | 4.000000 | 1.000000 | 1.000000 | 199992.480000 | 1.000000 |
Logged the following dataset metric to the ValidMind platform:
{'numerical': [{'Name': 'RowNumber', 'Count': 8000.0, 'Mean': 5020.52, 'Std': 2885.7185155986554, 'Min': 1.0, '25%': 2518.75, '50%': 5036.5, '75%': 7512.25, '90%': 9015.1, '95%': 9516.05, 'Max': 10000.0}, {'Name': 'CustomerId', 'Count': 8000.0, 'Mean': 15690474.465625, 'Std': 71902.473335347, 'Min': 15565701.0, '25%': 15628163.75, '50%': 15690143.5, '75%': 15752378.25, '90%': 15790809.1, '95%': 15802760.55, 'Max': 15815660.0}, {'Name': 'CreditScore', 'Count': 8000.0, 'Mean': 650.159625, 'Std': 96.84623014808636, 'Min': 350.0, '25%': 583.0, '50%': 651.5, '75%': 717.0, '90%': 778.0, '95%': 813.0, 'Max': 850.0}, {'Name': 'Age', 'Count': 8000.0, 'Mean': 38.948875, 'Std': 10.458952382767269, 'Min': 18.0, '25%': 32.0, '50%': 37.0, '75%': 44.0, '90%': 53.0, '95%': 60.0, 'Max': 92.0}, {'Name': 'Tenure', 'Count': 8000.0, 'Mean': 5.033875, 'Std': 2.885267419215253, 'Min': 0.0, '25%': 3.0, '50%': 5.0, '75%': 8.0, '90%': 9.0, '95%': 9.0, 'Max': 10.0}, {'Name': 'Balance', 'Count': 8000.0, 'Mean': 76434.09651125, 'Std': 62...
Logged the following dataset metric to the ValidMind platform:
[[{'field': 'CreditScore', 'value': 1.0}, {'field': 'Geography', 'value': 0.010103440458197478}, {'field': 'Gender', 'value': 0.008251776778083898}, {'field': 'Age', 'value': -0.007269780957496768}, {'field': 'Tenure', 'value': -0.006914675142663373}, {'field': 'NumOfProducts', 'value': 0.005677094521946256}, {'field': 'HasCrCard', 'value': -0.009291152528707963}, {'field': 'IsActiveMember', 'value': 0.030554141043824444}, {'field': 'Exited', 'value': -0.025533166369817405}], [{'field': 'CreditScore', 'value': 0.010103440458197478}, {'field': 'Geography', 'value': 1.0}, {'field': 'Gender', 'value': 0.035023152881466464}, {'field': 'Age', 'value': 0.053602289473512775}, {'field': 'Tenure', 'value': 0.015510338111733172}, {'field': 'NumOfProducts', 'value': 0.011118424429054087}, {'field': 'HasCrCard', 'value': 0.021747611293409512}, {'field': 'IsActiveMember', 'value': 0.02017951122934769}, {'field': 'Exited', 'value': 0.1784101181767361}], [{'field': 'CreditScore', 'value': 0.008251776778083898}, {'field': 'G...
Results for Tabular Data Quality Test Plan:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Finding all test plans available in the developer framework
We can find all the test plans available in the developer framework by calling the following functions:
- All test plans:
vm.test_plans.list_plans() - Describe a test plan:
vm.test_plans.describe_plan("tabular_dataset") - List all available tests:
vm.test_plans.list_tests()
As an example, here’s the outpout list_plans() and list_tests():
vm.test_plans.list_plans()| ID | Name | Description |
|---|---|---|
| binary_classifier_metrics | BinaryClassifierMetrics | Test plan for sklearn classifier metrics |
| binary_classifier_validation | BinaryClassifierPerformance | Test plan for sklearn classifier models |
| binary_classifier_model_diagnosis | BinaryClassifierDiagnosis | Test plan for sklearn classifier model diagnosis tests |
| binary_classifier | BinaryClassifier | Test plan for sklearn classifier models that includes both metrics and validation tests |
| tabular_dataset | TabularDataset | Test plan for generic tabular datasets |
| tabular_dataset_description | TabularDatasetDescription | Test plan to extract metadata and descriptive statistics from a tabular dataset |
| tabular_data_quality | TabularDataQuality | Test plan for data quality on tabular datasets |
| normality_test_plan | NormalityTestPlan | Test plan to perform normality tests. |
| autocorrelation_test_plan | AutocorrelationTestPlan | Test plan to perform autocorrelation tests. |
| seasonality_test_plan | SesonalityTestPlan | Test plan to perform seasonality tests. |
| unit_root | UnitRoot | Test plan to perform unit root tests. |
| stationarity_test_plan | StationarityTestPlan | Test plan to perform stationarity tests. |
| timeseries | TimeSeries | Test plan for time series statsmodels that includes both metrics and validation tests |
| time_series_data_quality | TimeSeriesDataQuality | Test plan for data quality on time series datasets |
| time_series_dataset | TimeSeriesDataset | Test plan for time series datasets |
| time_series_univariate | TimeSeriesUnivariate | Test plan to perform time series univariate analysis. |
| time_series_multivariate | TimeSeriesMultivariate | Test plan to perform time series multivariate analysis. |
| time_series_forecast | TimeSeriesForecast | Test plan to perform time series forecast tests. |
| regression_model_performance | RegressionModelPerformance | Test plan for performance metric of regression model of statsmodels library |
| regression_models_comparison | RegressionModelsComparison | Test plan for metrics comparison of regression model of statsmodels library |
vm.test_plans.list_tests()| Test Type | ID | Name | Description |
|---|---|---|---|
| Custom Test | dataset_metadata | DatasetMetadata | Custom class to collect a set of descriptive statistics for a dataset. This class will log dataset metadata via `log_dataset` instead of a metric. Dataset metadata is necessary to initialize dataset object that can be related to different metrics and test results |
| Metric | acf_pacf_plot | ACFandPACFPlot | Plots ACF and PACF for a given time series dataset. |
| Metric | adf | ADF | Augmented Dickey-Fuller unit root test for establishing the order of integration of time series |
| Metric | accuracy | AccuracyScore | Accuracy Score |
| Metric | auto_ar | AutoAR | Automatically detects the AR order of a time series using both BIC and AIC. |
| Metric | auto_ma | AutoMA | Automatically detects the MA order of a time series using both BIC and AIC. |
| Metric | auto_seasonality | AutoSeasonality | Automatically detects the optimal seasonal order for a time series dataset using the seasonal_decompose method. |
| Metric | auto_stationarity | AutoStationarity | Automatically detects stationarity for each time series in a DataFrame using the Augmented Dickey-Fuller (ADF) test. |
| Metric | box_pierce | BoxPierce | The Box-Pierce test is a statistical test used to determine whether a given set of data has autocorrelations that are different from zero. |
| Metric | csi | CharacteristicStabilityIndex | Characteristic Stability Index between two datasets |
| Metric | confusion_matrix | ConfusionMatrix | Confusion Matrix |
| Metric | dickey_fuller_gls | DFGLSArch | Dickey-Fuller GLS unit root test for establishing the order of integration of time series |
| Metric | dataset_correlations | DatasetCorrelations | Extracts the correlation matrix for a dataset. The following coefficients are calculated: - Pearson's R for numerical variables - Cramer's V for categorical variables - Correlation ratios for categorical-numerical variables |
| Metric | dataset_description | DatasetDescription | Collects a set of descriptive statistics for a dataset |
| Metric | dataset_split | DatasetSplit | Attempts to extract information about the dataset split from the provided training, test and validation datasets. |
| Metric | descriptive_statistics | DescriptiveStatistics | Collects a set of descriptive statistics for a dataset, both for numerical and categorical variables |
| Metric | engle_granger_coint | EngleGrangerCoint | Test for cointegration between pairs of time series variables in a given dataset using the Engle-Granger test. |
| Metric | f1_score | F1Score | F1 Score |
| Metric | jarque_bera | JarqueBera | The Jarque-Bera test is a statistical test used to determine whether a given set of data follows a normal distribution. |
| Metric | kpss | KPSS | Kwiatkowski-Phillips-Schmidt-Shin (KPSS) unit root test for establishing the order of integration of time series |
| Metric | kolmogorov_smirnov | KolmogorovSmirnov | The Kolmogorov-Smirnov metric is a statistical test used to determine whether a given set of data follows a normal distribution. |
| Metric | ljung_box | LJungBox | The Ljung-Box test is a statistical test used to determine whether a given set of data has autocorrelations that are different from zero. |
| Metric | lagged_correlation_heatmap | LaggedCorrelationHeatmap | Generates a heatmap of correlations between the target variable and the lags of independent variables in the dataset. |
| Metric | lilliefors_test | Lilliefors | The Lilliefors test is a statistical test used to determine whether a given set of data follows a normal distribution. |
| Metric | model_metadata | ModelMetadata | Custom class to collect the following metadata for a model: - Model architecture - Model hyperparameters - Model task type |
| Metric | model_prediction_ols | ModelPredictionOLS | Calculates and plots the model predictions for each of the models |
| Metric | pfi | PermutationFeatureImportance | Permutation Feature Importance |
| Metric | phillips_perron | PhillipsPerronArch | Phillips-Perron (PP) unit root test for establishing the order of integration of time series |
| Metric | psi | PopulationStabilityIndex | Population Stability Index between two datasets |
| Metric | pr_curve | PrecisionRecallCurve | Precision Recall Curve |
| Metric | precision | PrecisionScore | Precision Score |
| Metric | roc_auc | ROCAUCScore | ROC AUC Score |
| Metric | roc_curve | ROCCurve | ROC Curve |
| Metric | recall | RecallScore | Recall Score |
| Metric | RegressionModelInsampleComparison | Test that output the comparison of stats library regression models. | |
| Metric | RegressionModelOutsampleComparison | Test that evaluates the performance of different regression models on a separate test dataset that was not used to train the models. | |
| Metric | RegressionModelSummary | Test that output the summary of regression models of statsmodel library. | |
| Metric | residuals_visual_inspection | ResidualsVisualInspection | Log plots for visual inspection of residuals |
| Metric | rolling_stats_plot | RollingStatsPlot | This class provides a metric to visualize the stationarity of a given time series dataset by plotting the rolling mean and rolling standard deviation. The rolling mean represents the average of the time series data over a fixed-size sliding window, which helps in identifying trends in the data. The rolling standard deviation measures the variability of the data within the sliding window, showing any changes in volatility over time. By analyzing these plots, users can gain insights into the stationarity of the time series data and determine if any transformations or differencing operations are required before applying time series models. |
| Metric | runs_test | RunsTest | The runs test is a statistical test used to determine whether a given set of data has runs of positive and negative values that are longer than expected under the null hypothesis of randomness. |
| Metric | SHAPGlobalImportance | SHAP Global Importance | |
| Metric | scatter_plot | ScatterPlot | Generates a visual analysis of data by plotting a scatter plot matrix for all columns in the dataset. The input dataset can have multiple columns (features) if necessary. |
| Metric | seasonal_decompose | SeasonalDecompose | Calculates seasonal_decompose metric for each of the dataset features |
| Metric | shapiro_wilk | ShapiroWilk | The Shapiro-Wilk test is a statistical test used to determine whether a given set of data follows a normal distribution. |
| Metric | spread_plot | SpreadPlot | This class provides a metric to visualize the spread between pairs of time series variables in a given dataset. By plotting the spread of each pair of variables in separate figures, users can assess the relationship between the variables and determine if any cointegration or other time series relationships exist between them. |
| Metric | time_series_histogram | TimeSeriesHistogram | Generates a visual analysis of time series data by plotting the histogram. The input dataset can have multiple time series if necessary. In this case we produce a separate plot for each time series. |
| Metric | time_series_line_plot | TimeSeriesLinePlot | Generates a visual analysis of time series data by plotting the raw time series. The input dataset can have multiple time series if necessary. In this case we produce a separate plot for each time series. |
| Metric | zivot_andrews | ZivotAndrewsArch | Zivot-Andrews unit root test for establishing the order of integration of time series |
| ThresholdTest | class_imbalance | ClassImbalance | The class imbalance test measures the disparity between the majority class and the minority class in the target column. |
| ThresholdTest | duplicates | Duplicates | The duplicates test measures the number of duplicate rows found in the dataset. If a primary key column is specified, the dataset is checked for duplicate primary keys as well. |
| ThresholdTest | cardinality | HighCardinality | The high cardinality test measures the number of unique values found in categorical columns. |
| ThresholdTest | pearson_correlation | HighPearsonCorrelation | Test that the pairwise Pearson correlation coefficients between the features in the dataset do not exceed a specified threshold. |
| ThresholdTest | accuracy_score | MinimumAccuracy | Test that the model's prediction accuracy on a dataset meets or exceeds a predefined threshold. |
| ThresholdTest | f1_score | MinimumF1Score | Test that the model's F1 score on the validation dataset meets or exceeds a predefined threshold. |
| ThresholdTest | roc_auc_score | MinimumROCAUCScore | Test that the model's ROC AUC score on the validation dataset meets or exceeds a predefined threshold. |
| ThresholdTest | missing | MissingValues | Test that the number of missing values in the dataset across all features is less than a threshold |
| ThresholdTest | overfit_regions | OverfitDiagnosis | Test that identify overfit regions with high residuals by histogram slicing techniques. |
| ThresholdTest | robustness | RobustnessDiagnosis | Test robustness of model by perturbing the features column values |
| ThresholdTest | skewness | Skewness | The skewness test measures the extent to which a distribution of values differs from a normal distribution. A positive skew describes a longer tail of values in the right and a negative skew describes a longer tail of values in the left. |
| ThresholdTest | time_series_frequency | TimeSeriesFrequency | Test that detect frequencies in the data |
| ThresholdTest | time_series_missing_values | TimeSeriesMissingValues | Test that the number of missing values is less than a threshold |
| ThresholdTest | time_series_outliers | TimeSeriesOutliers | Test that find outliers for time series data using the z-score method |
| ThresholdTest | zeros | TooManyZeroValues | The zeros test finds columns that have too many zero values. |
| ThresholdTest | training_test_degradation | TrainingTestDegradation | Test that the degradation in performance between the training and test datasets does not exceed a predefined threshold. |
| ThresholdTest | unique | UniqueRows | Test that the number of unique rows is greater than a threshold |
| ThresholdTest | weak_spots | WeakspotsDiagnosis | Test that identify weak regions with high residuals by histogram slicing techniques. |
Preparing the dataset for training
Before we train a model, we need to run some common minimal feature selection and engineering steps on the dataset:
- Dropping irrelevant variables
- Encoding categorical variables
Dropping irrelevant variables
The following variables will be dropped from the dataset:
RowNumber: it’s a unique identifier to the recordCustomerId: it’s a unique identifier to the customerSurname: no predictive power for this variableCreditScore: we didn’t observer any correlation betweenCreditScoreand our target columnExited
df.drop(["RowNumber", "CustomerId", "Surname", "CreditScore"], axis=1, inplace=True)Encoding categorical variables
We will apply one-hot or dummy encoding to the following variables:
Geography: only 3 unique values found in the datasetGender: convert from string to integer
genders = {"Male": 0, "Female": 1}
df.replace({"Gender": genders}, inplace=True)df = pd.concat([df, pd.get_dummies(df["Geography"], prefix="Geography")], axis=1)
df.drop("Geography", axis=1, inplace=True)We are now ready to train our model with the preprocessed dataset:
df.head()| Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Geography_France | Geography_Germany | Geography_Spain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 | 1 | 0 | 0 |
| 1 | 1 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | 0 | 0 | 1 |
| 2 | 1 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | 1 | 0 | 0 |
| 3 | 1 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 | 1 | 0 | 0 |
| 4 | 1 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | 0 | 0 | 1 |
Dataset preparation
For training our model, we will randomly split the dataset in 3 parts:
trainingsplit with 60% of the rowsvalidationsplit with 20% of the rowstestsplit with 20% of the rows
The test dataset will be our held out dataset for model evaluation.
train_df, test_df = train_test_split(df, test_size=0.20)
# This guarantees a 60/20/20 split
train_ds, val_ds = train_test_split(train_df, test_size=0.25)
# For training
x_train = train_ds.drop("Exited", axis=1)
y_train = train_ds.loc[:, "Exited"].astype(int)
x_val = val_ds.drop("Exited", axis=1)
y_val = val_ds.loc[:, "Exited"].astype(int)
# For testing
x_test = test_df.drop("Exited", axis=1)
y_test = test_df.loc[:, "Exited"].astype(int)Model training
We will train a simple XGBoost model and set its eval_set to [(x_train, y_train), (x_val, y_val)] in order to collect validation datasets metrics on every round. The ValidMind library supports collecting any type of “in training” metrics so model developers can provide additional context to model validators if necessary.
model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
eval_metric=["error", "logloss", "auc"],
)
model.fit(
x_train,
y_train,
eval_set=[(x_train, y_train), (x_val, y_val)],
verbose=False,
)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=10,
enable_categorical=False, eval_metric=['error', 'logloss', 'auc'],
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=10,
enable_categorical=False, eval_metric=['error', 'logloss', 'auc'],
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=None, ...)y_pred = model.predict_proba(x_val)[:, -1]
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_val, predictions)
print(f"Accuracy: {accuracy}")Accuracy: 0.883125
Running a model evaluation test plan
We will now run a basic model evaluation test plan that is compatible with the model we have trained. Since we have trained an XGBoost model with a sklearn-like API, we will use the SKLearnClassifier test plan. This test plan will collect model metadata and metrics, and run a variety of model evaluation tests, according to the modeling objective (binary classification for this example).
The following model metadata is collected:
- Model framework and architecture (e.g. XGBoost, Random Forest, Logistic Regression, etc.)
- Model task details (e.g. binary classification, regression, etc.)
- Model hyperparameters (e.g. number of trees, max depth, etc.)
The model metrics that are collected depend on the model type, use case, etc. For example, for a binary classification model, the following metrics could be collected (again, depending on configuration):
- AUC
- Error rate
- Logloss
- Feature importance
Similarly, different model evaluation tests are run depending on the model type, use case, etc. For example, for a binary classification model, the following tests could be executed:
- Simple training/test overfit test
- Training/test performance degradation
- Baseline test dataset performance test
Initialize VM model object and train/test datasets
In order to run our SKLearnClassifier test plan, we need to initialize ValidMind object instances for the trained model and the training and test datasets:
vm_train_ds = vm.init_dataset(
dataset=train_df,
type="generic",
target_column="Exited"
)
vm_test_ds = vm.init_dataset(
dataset=test_df,
type="generic",
target_column="Exited"
)
vm_model = vm.init_model(
model,
train_ds=vm_train_ds,
test_ds=vm_test_ds,
)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
We can now run the SKLearnClassifier test plan:
model_plan = vm.run_test_plan("binary_classifier", model=vm_model)
Results for Binary Classifier Metrics Test Plan:
Logged the following model metric to the ValidMind platform:
{'architecture': 'Extreme Gradient Boosting', 'task': 'classification', 'subtask': 'binary', 'framework': 'XGBoost', 'framework_version': '1.7.5', 'language': 'Python 3.8.13', 'params': {'objective': 'binary:logistic', 'base_score': None, 'booster': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'eval_metric': ['error', 'logloss', 'auc'], 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'monotone_constraints': None, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}}
Logged the following dataset metric to the ValidMind platform:
{'total_size': 8000}
Logged the following evaluation metric to the ValidMind platform:
0.85375
Logged the following evaluation metric to the ValidMind platform:
{'tn': 1221, 'fp': 54, 'fn': 180, 'tp': 145}
Logged the following evaluation metric to the ValidMind platform:
0.5534351145038168
Logged the following training metric to the ValidMind platform:
{'Gender': ([0.0031562499999999495], [0.0008466385740090111]), 'Age': ([0.09284374999999996], [0.004049739574960349]), 'Tenure': ([0.006312499999999943], [0.0010670856924352328]), 'Balance': ([0.032624999999999946], [0.0028391102259334882]), 'NumOfProducts': ([0.07334374999999997], [0.0030500384218891353]), 'HasCrCard': ([0.0011249999999999316], [0.000387802301437223]), 'IsActiveMember': ([0.039249999999999986], [0.0016488632447841187]), 'EstimatedSalary': ([0.00868749999999996], [0.0006745947857788526]), 'Geography_France': ([0.001687499999999953], [0.0004571480886977634]), 'Geography_Germany': ([0.021437499999999977], [0.0017382417481466934]), 'Geography_Spain': ([0.0008437499999999875], [0.0003217384419058682])}
Logged the following evaluation metric to the ValidMind platform:
{'precision': array([0.203125 , 0.20325203, 0.20337922, ..., 1. , 1. ,
1. ]), 'recall': array([1. , 1. , 1. , ..., 0.00615385, 0.00307692,
0. ]), 'thresholds': array([0.00810315, 0.00885704, 0.00898217, ..., 0.980356 , 0.9822782 ,
0.9835564 ], dtype=float32)}
Logged the following evaluation metric to the ValidMind platform:
0.7286432160804021
Logged the following evaluation metric to the ValidMind platform:
0.4461538461538462
Logged the following evaluation metric to the ValidMind platform:
0.7019004524886877
Logged the following evaluation metric to the ValidMind platform:
{'auc': 0.7019004524886877, 'fpr': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.84313725e-04,
7.84313725e-04, 1.56862745e-03, 1.56862745e-03, 2.35294118e-03,
2.35294118e-03, 3.92156863e-03, 3.92156863e-03, 4.70588235e-03,
4.70588235e-03, 5.49019608e-03, 5.49019608e-03, 6.27450980e-03,
6.27450980e-03, 7.05882353e-03, 7.05882353e-03, 8.62745098e-03,
8.62745098e-03, 9.41176471e-03, 9.41176471e-03, 1.09803922e-02,
1.09803922e-02, 1.17647059e-02, 1.33333333e-02, 1.33333333e-02,
1.41176471e-02, 1.41176471e-02, 1.56862745e-02, 1.56862745e-02,
1.64705882e-02, 1.64705882e-02, 1.88235294e-02, 1.88235294e-02,
1.96078431e-02, 1.96078431e-02, 2.11764706e-02, 2.11764706e-02,
2.19607843e-02, 2.19607843e-02, 2.27450980e-02, 2.27450980e-02,
2.50980392e-02, 2.50980392e-02, 2.58823529e-02, 2.58823529e-02,
2.66666667e-02, 2.66666667e-02, 2.74509804e-02, 2.74509804e-02,
2.90196078e-02, 2.90196078e-02, 2.98039216e-02, 2.98039216e...
Logged the following training metric to the ValidMind platform:
{'Gender': 7.3e-05, 'Age': 0.000503, 'Tenure': 0.000993, 'Balance': 0.000763, 'NumOfProducts': 6.3e-05, 'HasCrCard': 8.6e-05, 'IsActiveMember': 5.4e-05, 'EstimatedSalary': 0.00066, 'Geography_France': 0.0, 'Geography_Germany': 3.4e-05, 'Geography_Spain': 4.1e-05}
Logged the following training metric to the ValidMind platform:
initial percent_initial new percent_new psi bin 1 3494 0.545938 861 0.538125 0.000113 2 1018 0.159062 278 0.173750 0.001297 3 467 0.072969 124 0.077500 0.000273 4 300 0.046875 73 0.045625 0.000034 5 235 0.036719 62 0.038750 0.000109 6 156 0.024375 36 0.022500 0.000150 7 168 0.026250 44 0.027500 0.000058 8 161 0.025156 29 0.018125 0.002305 9 142 0.022188 34 0.021250 0.000040 10 259 0.040469 59 0.036875 0.000334